feat(#990): Foundry V3 hosted-agents pilot end-to-end (with portal-visibility fixes)#1103
Conversation
Implements end-to-end support for Azure AI Foundry V3 Hosted Agents (preview).
Framework (lib/):
- holiday_peak_lib.foundry_hosting: manifest loader (Pydantic v2),
env-var resolver, deploy wrapper over AIProjectClient.agents.create_version
with terminal-status polling and async helper. Azure SDK imports lazy.
- mount_hosted_agent: default prefix flipped to '' (Foundry gateway adds
/openai/v1/ externally and forwards to container '/responses').
- BaseRetailAgent.serve_hosted: default prefix updated.
Product (apps/inventory-health-check/):
- agent.hosted.yaml NEW: V3 registration manifest (kind: hosted,
responses 1.0.0, 19 env vars, gpt-5-nano + gpt-5 model resources).
- agent.yaml: tracking shape preserved, doc-only update referencing sibling.
- Dockerfile: --port \${UVICORN_PORT:-8000} for hosted-runtime portability.
- main.py: docstring updated to /responses.
Ops:
- scripts/ops/deploy_hosted_agent.py NEW: CLI runbook entry point.
Tests:
- 24 new tests across manifest loader, deploy wrapper, and hosted mount.
- Fleet-wide tests/ops/test_foundry_portal_tracking_manifests.py guardrail
preserved (27/27 pass).
- Targeted suite: 56 passed in 6.40s.
- pylint 9.78/10 on new module + CLI.
…ples for portal visibility
Four bugs identified by line-by-line comparison against the official MS Learn
`Deploy a hosted agent` doc and the `Microsoft/foundry-samples` repository
(after the scaffolding-only initial PR did not produce a visible agent):
1. `AIProjectClient` now built with `allow_preview=True` so the
`agents.create_version` V3 surface is actually exposed. Without the flag
the SDK silently routes to legacy assistants and the new agent never
materializes in the New Foundry portal.
2. Terminal status set tightened from
`{active,ready,succeeded,failed,error}` to the documented terminal set
`{active,failed,deleted}` with `active` as the only success terminal;
`deleting` is correctly treated as transient.
3. `apps/inventory-health-check/agent.hosted.yaml` no longer redeclares
the platform-injected `APPLICATIONINSIGHTS_CONNECTION_STRING`. The full
forbidden list is now documented inline:
`FOUNDRY_PROJECT_ENDPOINT`, `FOUNDRY_PROJECT_ARM_ID`,
`FOUNDRY_AGENT_NAME`, `FOUNDRY_AGENT_VERSION`,
`FOUNDRY_AGENT_SESSION_ID`, `APPLICATIONINSIGHTS_CONNECTION_STRING`.
Collisions on any of these cause `create_version` to reject the manifest.
4. Pilot manifest renamed `FOUNDRY_AGENT_NAME_FAST` / `_RICH` to
`FOUNDRY_AGENT_ID_FAST` / `_RICH` to match the runtime contract in
ADR-010 / `holiday_peak_lib.config._build_foundry_config`.
Loader (`load_manifest`) also probes `agent.manifest.yaml` first \u2014 the
canonical name used by `foundry-samples` and `azd ai agent init -m` \u2014
before falling back to `agent.hosted.yaml` and `agent.yaml`, so future
services may adopt either name without changes to the loader.
Tests: 18 hosting tests pass; one new test covers `deleted` status, one
verifies `allow_preview=True` is passed to `AIProjectClient`, two cover
the loader filename-priority ordering. Pylint 9.78/10.
…ationError)
Foundry V3 hosted-agents platform reserves the entire FOUNDRY_*/AGENT_* env-var namespaces (per container-image-spec). Live create_version returned: 'Environment variable FOUNDRY_AGENT_ID_FAST is reserved for platform use.' Rename in-container env vars to HPH_AGENT_ID_FAST/_RICH (and matching HPH_AGENT_NAME_*). build_foundry_config now reads HPH_ first with FOUNDRY_AGENT_ID_* fallback so AKS deploys remain back-compat. Operator env contract unchanged: external ${FOUNDRY_AGENT_ID_FAST} is mapped to HPH_AGENT_ID_FAST inside the container via manifest placeholder substitution.
…ctive' Foundry SDK 2.1.0 returns status as AgentVersionStatus enum whose str() is 'AgentVersionStatus.FAILED'. Previous str().lower() produced 'agentversionstatus.failed' which never matched terminal sets. Add _normalize_status helper that prefers enum .value and falls back to stripping dotted Enum.MEMBER prefix. Three new tests cover all paths.
…convention
The azure-ai-agentserver-core framework reads the canonical PORT env var
(default 8088) via resolve_port(), but our containers were only listening
on UVICORN_PORT. This caused Foundry V3 hosted-agent invocations to
return 424 session_not_ready because the gateway probed PORT=8088 while
uvicorn was bound to UVICORN_PORT.
Changes:
apps/inventory-health-check/src/Dockerfile
- CMD now reads ${PORT:-${UVICORN_PORT:-8088}} so Foundry V3 PORT
takes precedence, AKS keeps UVICORN_PORT=8000 as legacy override,
and the framework default of 8088 is the fallback.
apps/inventory-health-check/agent.hosted.yaml
- Add PORT=8088 (canonical Foundry V3), UVICORN_PORT=8088 (alignment),
and WEB_CONCURRENCY=1 to keep startup under readiness deadline.
lib/src/holiday_peak_lib/app_factory.py
- _service_lifespan now emits six explicit lifespan_* log lines for
trace correlation in App Insights.
lib/src/holiday_peak_lib/foundry_hosting/deploy.py
- _extract handles collections.abc.Mapping (AgentVersionDetails is a
MutableMapping, not a dict, and exposes fields via __getitem__).
- Add _pick_latest_version tolerant of v3, 3, 3.1.0 label shapes.
lib/tests/test_foundry_hosting_deploy.py
- 479 new lines covering Mapping branch, picker, and re-fetch path.
memories/session/foundry-v3-pilot-status.md
- Resume-state notes: PORT root cause, namespace collision,
lifespan-mount behavior, ACR drift correction.
Refs #990. PR #1103.
Foundry V3 hosted-agents reject "PORT" with "invalid_payload: Environment
variable 'PORT' is reserved for platform use". The Dockerfile CMD already
reads the platform-injected value first, then UVICORN_PORT, then 8088, so
removing PORT here lets the platform inject its own value automatically.
Keep UVICORN_PORT for local docker-run / AKS dev parity.
Also refresh memories/session/foundry-v3-pilot-status.md with the runbook
proven during the pilot:
1. ACR azureAdAuthenticationAsArmPolicy must be enabled
2. AI-account system MI needs AcrPull + Container Registry Repository
Reader on the canonical ACR (not only the project MI)
3. Per-version agent MI and blueprint MI need Foundry User on the
project when deploying via the SDK (azd auto-handles this; SDK
path does not)
v20 of inventory-health-check is now active and returns 200 from
/responses with a structured domain answer.
Refs: #990
|
Follow-up issues filed for the operational findings discovered during root-cause analysis of v15-v19
PR body updated with the new findings and end-to-end invocation evidence (v20 active, status=completed, Foundry storage POST -> 201). The three operational fixes are already applied to the live |
…treaming Foundry's ResponsesHostServer (agent-framework-foundry-hosting==1.0.0a260507) calls agent.run with two distinct contracts depending on stream: stream=False -> response = await agent.run(stream=False, ...) # coroutine stream=True -> async for update in agent.run(stream=True, ...): # iterator Our adapter was marked async def run, so it always returned a coroutine. When the Foundry portal Playground (which always sets stream=True) hit the adapter, the framework tried to async-iterate the coroutine and crashed with: 'async for' requires an object with __aiter__, got coroutine. Fix: reshape run() into a synchronous dispatcher that returns either a coroutine (_run_once -> AgentResponse) or an async iterator (_run_streaming -> AgentResponseUpdate) based on the stream flag. The streaming path emits a single AgentResponseUpdate carrying one text content -- sufficient for the SSE tracker to render and terminate the stream cleanly. Per-token streaming via invoke_model_stream remains a follow-up. Tests: - Replaced test_hosted_run_adapter_refuses_streaming with test_hosted_run_adapter_streams_single_update (pins the async-iterator contract) - Added test_hosted_run_adapter_non_streaming_returns_awaitable to pin the awaitable contract for stream=False - All 12 hosted-adapter tests pass; 1360 lib tests pass; 3 pilot tests pass Refs: PR #1103
Streaming-protocol fix landed:
|
Consolidate the manual `az role assignment create` runbook step into
`deploy_hosted_agent_version`. The per-version managed identity minted by
`AIProjectClient.agents.create_version` does NOT receive the Foundry User
role on the project scope automatically, so every Playground / Responses
invocation fails 401 on the storage POST:
Foundry storage POST .../storage/responses?api-version=v1 -> 401
Principal does not have access to API/Operation.
The `azd` and VS-Code extension deploy paths grant it implicitly; the SDK
path (this module) did not, leaving operators to remember a manual step
that was easy to skip. This change closes the loop in code.
Implementation
--------------
* `deploy_hosted_agent_version` accepts `auto_grant_role` (default True),
`foundry_role_name` ("Foundry User"), `project_scope` (optional override),
`role_granter` + `scope_resolver` (test seams).
* On reaching `active`, the helper resolves the per-version principal id
from `version_obj.instance_identity.principal_id` (with two preview-era
field aliases), derives the project ARM scope from `project_endpoint`
via `az resource list`, and calls `az role assignment create` with the
`--assignee-principal-type ServicePrincipal` flag — matching the manual
runbook one-for-one.
* The grant is idempotent: `RoleAssignmentExists` from the Azure CLI is
treated as success and recorded as `status=already_exists`.
* A failed grant does NOT mask a successful version activation. The
failure is captured in `result.extras["role_grant"]` with `status=failed`
and `error=<stderr>` so operators can re-run or escalate.
* `scripts/ops/deploy_hosted_agent.py` exposes `--no-auto-grant-foundry-user`,
`--foundry-role-name`, and `--project-scope` CLI flags. The JSON output
now includes the `role_grant` payload.
Tests
-----
* +12 tests in `lib/tests/test_foundry_hosting_deploy.py`:
- principal-id extraction (3 shapes: `instance_identity`, `managed_identity`,
`Mapping`) + missing-id null path
- scope derivation: resolver test seam, malformed endpoint, no-account
- integration: granted / skipped / already-exists / failure / no-principal
/ explicit-scope-override
- default `_grant_role_via_az`: success (parses assignment id),
already-exists (idempotent), real-failure (raises)
* All 1376 lib tests + 108 pilot tests pass.
Refs: #1107, runbook docs/ops/foundry-hosted-agents.md
|
Fix #8 landed:
What changed
Verification
Behaviour after mergeOperators no longer need to follow verify-step 4 from the original PR body. This unblocks the v21 ( |
233016e to
02b912d
Compare
#1107 live validation update: hosted Redis/Event Hub isolationPushed commit What changed
Live deployment
Live Responses API validation
The prior Playground-style failure mode (HTTP 200 SSE starts, then hangs without completion until timeout) did not reproduce on v24. Validation gates
|
#1107 live validation update: hosted Redis/Event Hub isolationPushed commit What changed
Live deployment
Live Responses API validation
The prior Playground-style failure mode (HTTP 200 SSE starts, then hangs without completion until timeout) did not reproduce on v24. Validation gates
|
UI route-segment bundle budgetsAdvisory at v1 (does not block PRs). Strict mode activates after the F1 cleanup follow-up trims dead-weight deps from the global path. Budgets live in |
Summary
Pilot end-to-end Foundry V3 Hosted Agents for
inventory-health-check, validated by a successful HTTP 200 invocation against the public Responses endpoint in theaipholidarisproject — for bothstream=false(curlping) andstream=true(Foundry portal Playground, after881b49a8).This PR also lands the eight live-deployment fixes discovered while running the pilot against the platform, all of which now have regression tests. Fixes #1–#5 came from the initial activation track; #6 (the
PORTreserved-name regression) was found in the previous session; #7 (the streaming-protocol contract) was found this session when the Foundry portal Playground surfaced an'async for' requires __aiter__, got coroutineTypeError that the originalpingtest (stream=false) had not exercised. #8 codifies the Foundry User role auto-grant inscripts/ops/deploy_hosted_agent.py(closing #1107) so the manualaz role assignment createrunbook step is no longer required.| 8 (new) | The Foundry SDK deploy path (
scripts/ops/deploy_hosted_agent.py) did not grant theFoundry Userrole to the per-version managed identity minted bycreate_version. Without that role, the container ran fine but the Foundry runtime returned 401 onPOST .../storage/responsesand the Playground surfaced a generic 'internal error storing the response' toast. Manualaz role assignment createwas the workaround. |deploy_hosted_agent_versionnow auto-resolves the per-versioninstance_identity.principal_id, derives the project ARM scope fromproject_endpointviaaz resource list, and callsaz role assignment create --assignee-principal-type ServicePrincipal --role 'Foundry User'. Idempotent onRoleAssignmentExists. New CLI flags:--no-auto-grant-foundry-user,--foundry-role-name,--project-scope. Failure does NOT mask a successful version activation -- it is recorded inresult.extras['role_grant']. Closes #1107. |test_deploy_auto_grants_foundry_user_after_active,test_deploy_skips_grant_when_auto_grant_disabled,test_deploy_records_already_exists_when_granter_returns_none,test_deploy_surfaces_grant_failure_without_breaking_active,test_deploy_records_skipped_when_principal_id_missing,test_deploy_uses_explicit_project_scope_override,test_grant_role_via_az_treats_already_exists_as_idempotent,test_grant_role_via_az_raises_on_real_failure,test_grant_role_via_az_parses_assignment_id_on_success,test_extract_principal_id_from_instance_identity,test_extract_principal_id_from_managed_identity_alias,test_extract_principal_id_from_mapping|End-to-end invocation evidence (final state)
App Insights trace for the same invocation:
This proves the full V3 hosted-agent lifecycle: deploy → activate → invoke → container → agent code → Foundry storage → response → client — all healthy in the
stream=falsepath. The streaming path is validated by the new hosted-adapter unit tests (test_hosted_run_adapter_streams_single_updateandtest_hosted_run_adapter_non_streaming_returns_awaitable) and will be re-verified against Foundry once v21+ is deployed with881b49a8.holidaypeakhub405devacr.azurecr.io/inventory-health-check:foundry-v3(digestsha256:5b9d8601…)ImageError— root-caused toazureADAuthenticationAsArmPolicy=disabledon canonical ACRholidaypeakhub405devacr.azurecr.io/inventory-health-check@sha256:d4775cdf…(tagfoundry-v6, build runcj28) — invoked end-to-end (non-streaming)881b49a8streaming-protocol fix for Playground/SSE invocationsLive fixes landed in this PR
AIProjectClientrejected hosted manifests with HTTP 400 because the SDK requires preview opt-in._build_project_clientconstructsAIProjectClient(..., allow_preview=True); agent IDs are looked up by name throughagents.get_version(...)(legacy ID URL is gone in V3).test_build_project_client_passes_allow_previewagent.yaml/agent.manifest.yamlwere tried).manifest.pyloader now probesagent.manifest.yaml->agent.hosted.yaml->agent.yamlso a hosted-only manifest can sit alongside the metadata-onlyagent.yamlwithout changing the portal-tracking contract.agent.hosted.yamldeclared the protocol version field that V3 rejects.template.kind: hostedshape withprotocols: [{protocol: responses, version: "1.0.0"}]andcontainer.cpu/memoryinstead of nesteddefinition.*.FOUNDRY_*/AGENT_*were rejected bycreate_versionwithValidationError: ... reserved per container-image-spec. The platform reserves the entireFOUNDRY_*andAGENT_*namespaces, not just the six documented platform-injected names.HPH_AGENT_ID_FAST/HPH_AGENT_ID_RICH(andHPH_AGENT_NAME_*).holiday_peak_lib.app_factory_components.foundry_lifecycle.build_foundry_confignow reads theHPH_prefix first and falls back to the legacyFOUNDRY_AGENT_ID_*/FOUNDRY_AGENT_NAME_*for AKS deploys (back-compat).test_build_foundry_config_prefers_hph_agent_id_over_foundry_agent_id,test_build_foundry_config_hph_agent_name_takes_precedence"status": "failed"into anAgentVersionStatusenum whosestr()returns"AgentVersionStatus.FAILED". The previousstr(status).lower()produced"agentversionstatus.failed", which did not match_TERMINAL_STATUSES = {"active","failed","deleted"}, so the script timed out instead of raisingRuntimeError._normalize_statusinlib/src/holiday_peak_lib/foundry_hosting/deploy.pythat prefers the enum.valuefield and falls back to stripping anyEnum.MEMBERdotted prefix.test_normalize_status_handles_enum_value,test_normalize_status_strips_dotted_enum_repr,test_normalize_status_plain_stringPORT=8088declaration withinvalid_payload: Environment variable 'PORT' is reserved for platform use. The reserved namespace expanded between the time we wrote the pilot manifest and the final activation.apps/inventory-health-check/agent.hosted.yamlno longer declaresPORT. The existing Dockerfile CMD (${PORT:-${UVICORN_PORT:-8088}}) picks up the platform-injected value first, then falls back toUVICORN_PORT(still declared) for local docker-run / AKS dev parity.ResponsesHostServer(preview SDKagent-framework-foundry-hosting==1.0.0a260507) callsagent.runwith two distinct contracts depending onstream:await agent.run(stream=False, ...)expects a coroutine returningAgentResponse;async for update in agent.run(stream=True, ...):expects an async iterator ofAgentResponseUpdateitems. Our_HostedAgentRunAdapter.runwas markedasync def, so it always returned a coroutine. The Playground (which defaults tostream=true) triggeredTypeError: 'async for' requires an object with __aiter__, got coroutineat upstream_responses.py:341, which cascaded into a 401 on the persistence write because the response object had already been created server-side. Thepingtest passed because it used the non-streaming path.lib/src/holiday_peak_lib/agents/hosted.pyreshapes_HostedAgentRunAdapter.runinto a synchronous dispatcher that returns either a coroutine (_run_once→AgentResponse) or an async iterator (_run_streamingyielding oneAgentResponseUpdate(contents=[Content(type='text', text=…)])) based on thestreamflag. The shared_invoke_handlehelper keeps the translation/dispatch/extraction logic in one place. Per-token streaming viainvoke_model_streamremains a follow-up.test_hosted_run_adapter_streams_single_update(pins async-iterator contract),test_hosted_run_adapter_non_streaming_returns_awaitable(pins awaitable contract)Three operational findings (project- and ACR-side — not in this PR's code; followed up in #1107 and #1108)
These are environment-level prerequisites discovered while activating v20. They are already applied to the live
devenvironment and documented in memories/session/foundry-v3-pilot-status.md. They are out of scope for this PR but tracked for codification:policies.azureADAuthenticationAsArmPolicy.statusmust beenabledon the canonical registry. If disabled, the platform rejects the ARM→ACR token exchange and surfaces a genericImageErrorwith zero pull attempts recorded on the ACR. This was the single root cause of v15–v19 failures.351cdb70-…) needsAcrPullandContainer Registry Repository Readeron the canonical ACR. The docs name only the project MI; the live behaviour requires both.scripts/ops/deploy_hosted_agent.py), the per-version agent MI minted bycreate_versiondoes not getFoundry Useron the project automatically (theazd/ VS-Code extension path does). Without it, the container runs and the agent code succeeds, but storagePOST /storage/responsesreturns 401 and the public call surfaces as HTTP 500 (or, in the Playground, "An internal error occurred while storing the response").az role assignment createaftercreate_versionis the runbook today.How to verify in this branch (updated)
Expected:
status=activefrom step 3,status=completedfrom step 5 with the structured agent response in both the non-streaming (curl) and streaming (Playground) paths.Out of scope (tracked elsewhere)
deploy_hosted_agent.pyshould auto-grantFoundry Useron the new per-version MI aftercreate_version, eliminating the manual step in verify-step 4.azureADAuthenticationAsArmPolicy=enabled, AI-account MIAcrPull+Container Registry Repository Reader) in IaC so they survive a registry rebuild.invoke_model_stream(currently the streaming path emits oneAgentResponseUpdatechunk carrying the full reply; the SSE tracker handles it correctly, but richer token-by-token rendering remains a follow-up).holiday-peak-db/enterprise-memoryCosmos containers with hosted-agent specific containers (current pilot reuses existing infra).